Some Information-Theoretic Computations 
Related to the Distribution of Prime Numbers 



Submitted to Jorma Rissanen's Festschrift volume 



I. Kontoyiannis* 
February 2, 2008 



Abstract 

We illustrate how elementary information-theoretic ideas may be employed to provide 
proofs for well-known, nontrivial results in number theory. Specifically, we give an elementary 
and fairly short proof of the following asymptotic result, 

> ~ logn, as n — > oo, 

where the sum is over all primes p not exceeding n. We also give finite-n bounds refining 
the above limit. This result, originally proved by Chebyshev in 1852, is closely related to 
the celebrated prime number theorem. 
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1 Introduction 



The significant depth of the connection between information theory and statistics appears to 
have been recognized very soon after the birth of information theory [17] in 1948; a book-length 
exposition was provided by Kullback [12] already in 1959. In subsequent decades much was 
accomplished, and in the 1980s the development of this connection culminated in Rissanen's 
celebrated work [14] [15] [16], laying the foundations for the notion of stochastic complexity and 
the Minimum Description Length principle, or MDL. 

Here we offer a first glimpse of a different connection, this time between information theory 
and number theory. In particular, we will show that basic information-theoretic arguments 
combined with elementary computations can be used to give a new proof for a classical result 
concerning the distribution of prime numbers. The problem of understanding this "distribution" 
(including the issue of exactly what is meant by that statement) has, of course, been at the heart 
of mathematics since antiquity, and it has led, among other things, to the development of the 
field of analytic number theory; e.g., Apostol's text [1] offers an accessible introduction and [2] 
gives a more historical perspective. 

A major subfield is probabilistic number theory, where probabilistic tools are used to derive 
results in number theory. This approach, pioneered by, among others, Mark Kac and Paul Erdos 
from the 1930s on, is described, e.g., in Kac's beautiful book [11], Billingsley's review [3], and 
Tenenbaum's more recent text [18]. The starting point in much of the relevant literature is the 
following setup: For a fixed, large integer n, choose a random integer N from {1,2,..., n}, and 
write it in its unique prime factorization, 

N=Y[p x ", (1) 

p<n 

where the product runs over all primes p not exceeding n, and X p is the largest power k > 
such that p k divides TV. Through this representation, the uniform distribution on N induces a 
joint distribution on the {X p ; p < n}, and the key observation is that, for large n, the random 
variables {X p } are distributed approximately like independent geometries. Indeed, since there 
are exactly [n/p k \ multiples of p k between 1 and n, 



Pr{A p >k} = Pr{N is a multiple of p k } = - —r « - , for large n, (2) 
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so the distribution of X p is approximately geometric. Similarly, for the joint distribution of the 
{X p } we find, 



Pr{A Pi > k Pi for primes pi,p 2 , ■ ■ ■ ,p m < n} 
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showing that the {X p } are approximately independent. 

This elegant approximation is also mathematically powerful, as it makes it possible to trans- 
late standard results about collections of independent random variables into important properties 
that hold for every "typical" integer N. Billingsley in his 1973 Wald Memorial Lectures [3] gives 
an account of the state-of-the-art of related results up to that point, but he also goes on to make 
a further, fascinating connection with the entropy of the random variables {AT P }. 
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Billingsley's argument essentially begins with the observation that, since the representa- 
tion (1) is unique, the value of N and the values of the exponents {X p } are in a one-to-one 
correspondence; therefore, the entropy of N is the same as the entropy of the collection {Xp}, 1 

logn = H(N) = H(X P ; p < n). 

And since the random variables {X p } are approximately independent geometries, we should 
expect that, 

logn = H(X p ; p < n) « £ H(X p ) » £ [-^ - log (l - 1)] , (3) 

p<n p<n 

where in the last equality we simply substituted the well-known expression for the entropy of 
a geometric random variable (see Section 2 for details on the definition of the entropy and its 
computation). For large p, the above summands behave like to first order, leading to the 
asymptotic estimate, 

logp 



log n, for large n. 



Our main goal in this paper is to show that this approximation can indeed be made rigorous, 
mostly through elementary information-theoretic arguments; we will establish: 

Theorem 1. As n — > oo, 

C(n):=^^~logn, (4) 

p<n 

where the sum is over all primes p not exceeding n? 

As described in more detail in the following section, the fact that the joint distribution of the 
{X p } is asymptotically close to the distribution of independent geometries is not sufficient to 
turn Billingsley's heuristic into an actual proof - at least, we were not able to make the two "~" 
steps in (3) rigorous directly. Instead, we provide a proof in two steps. We modify Billingsley's 
heuristic to derive a lower bound on C(n) in Theorem 2, and in Theorem 3 we use a different 
argument, again going via the entropy of N, to compute a corresponding upper bound. These 
two combined prove Theorem 1, and they also give finer, finite-n bounds on C(n). 

In Section 2 we state our main results and describe the intuition behind their proofs. We 
also briefly review some other elegant information-theoretic arguments connected with bounds 
on the number of primes up to n. The appendix contains the remaining proofs. 

Before moving on to the results themselves, a few words about the history of Theorem 1 
are in order. The relationship (4) was first proved by Chebyshev [7] [6] in 1852, where he also 
produced finite-n bounds on C(n), with explicit constants. Chebyshev's motivation was to 
prove the celebrated prime number theorem (PNT), stating that 7r(n), the number of primes 
not exceeding n , grows like, 

n 

7r(n) ~ , as n — > oo. 

logn 



1 For definiteness, we take log to denote the natural logarithm to base e throughout, although the choice of the 
base of the logarithm is largely irrelevant for our considerations. 

2 As usual, the notation "a n ~ b n as n — ♦ oo" means that lmin^oo a n /b n — 1. 
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This was conjectured by Gauss around 1792, and it was only proved in 1896; Chebyshev was 
not able to produce a complete proof, but he used (4) and his finer bounds on C(n) to show 
that 7r(n) is of order Although we will not pursue this direction here, it is actually not 

hard to see that the asymptotic behavior of C(n) is intimately connected with that of ir(n). For 
example, a simple exercise in summation by parts shows that ir(n) can be expressed directly in 
terms of C{n): 

7r(n) = n , + 1 ^ C{n)-Y f fe + * --Aj)c(fc), for all n> 3. (5) 
v ; log(n + l) v ; £^Vlog(fc + l) log kJ 

For the sake of completeness, this is proved in the appendix. 

The PNT was finally proved in 1896 by Hadamard and by de la Vallee-Pousin. Their proofs 
were not elementary - both relied on the use of Hadamard's theory of integral functions applied 
to the Riemann zeta function C(s); see [2] for some details. In fact, for quite some time it was 
believed that no elementary proof would ever be found, and G.H. Hardy in a famous lecture to 
the Mathematical Society of Copenhagen in 1921 [4] went as far as to suggest that 11 if anyone 
produces an elementary proof of the PNT ... he will show that ... it is time for the books to be 
cast aside and for the theory to be rewritten." It is, therefore, not surprising that Selberg and 
Erdos' announcement in 1948 that they had produced such an elementary proof caused a great 
sensation in the mathematical world; see [9] for a survey. In our context, it is interesting to note 
that Chebyshev's result is again used explicitly in one of the steps of this elementary proof. 

Finally we remark that, although the simple arguments in this work fall short of giving 
estimates precise enough for an elementary information-theoretic proof of the PNT, it may not 
be entirely unreasonable to hope that such a proof may exist. 



2 Primes and Bits: Heuristics and Results 
2.1 Preliminaries 

For a fixed (typically large) n > 2, our starting point is the setting described in the introduction. 
Take N to be a uniformly distributed integer in {1,2, ... ,n} and write it in its unique prime 
factorization as in (1), 

N = Y[p X * = p%$\ 

p<n 

where ir(n) denotes the number of primes pi,P2, ■ ■ ■ ,Pn(n) U P to n, and X p is the largest integer 
power k > such that p k divides N. As noted in (2) above, the distribution of X p can be 
described by, 



n 



Pr{A p >k} = - J . for all k > 1, (6) 



1 

n 

This representation also gives simple upper and lower bounds on its mean E(X p ), 

l\k 1/p 1 



, p := E(X P ) = J>{A P > k} < £ (-) = ^ = (7) 

k>l k>l 1 IFF 



and u„ > Pr{A p > 1} > - — — . (8) 

p n 
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Recall the important observation that the distribution of each X p is close to a geometric. 
To be precise, a random variable Y with values in {0,1,2,...} is said to have a geometric 
distribution with mean ji > 0, denoted Y ~ Geom(/x), if Pr{Y = k} = fi k /(l + ^) k+1 , for all 
k > 0. Then Y of course has mean E(Y) = /i and its entropy is, 

h{ji) := fl"(Geom(/i)) = - ^ Pr{F = fc} log Pr{Y = k} = {n + 1) log(/i + 1) - /i log /x. (9) 

fe>0 

See, e.g., [8] for the standard properties of the entropy. 

2.2 Billingsley's Heuristic and Lower Bounds on C(n) 

First we show how Billingsley's heuristic can be modified to yield a lower bound on C(n). 
Arguing as in the introduction, 

logn ( = } H(N) ^ H(X P ; P <n)<J2 H{X p ) < £ (Geom(^)) ( = } £ (10) 

p<n P<n P<ri 

where (a) is simply the entropy of the uniform distribution, (6) comes from the fact that N 
and the {X p } are in a one-to-one correspondence, (c) is the well-known subadditivity of the 
entropy, (d) is because the geometric has maximal entropy among all distributions on the non- 
negative integers with a fixed mean, and (e) is the definition of h([x) in (9). Noting that h(/i) is 
nondecreasing in \i and recalling the upper bound on [i p in (7) gives, 

logn < £ h M < £ h(l/(p - 1)) = E [-^ log (-^) - log . (11) 

Rearranging the terms in the sum proves: 
Theorem 2. For all n > 2, 

n«> :=£ - log (l -!)]>!<*„. 

p<ra 



Since the summands above behave like -^p for large p, it is not difficult to deduce the following 



lower bounds on C(n) = Y, p < n ~~jr : 
Corollary 1. [Lower Bounds on C(n)] 

(i) i imin f£M>i ; 

ra-^oo log n 

86 

(it) C{n) > — log n - 2.35, for all n > 16. 

J..ZO 

Corollary 1 is proved in the appendix. Part (i) proves half of Theorem 1, and (ii) is a simple 
evaluation of the more general bound derived in equation (15) in the proof: For any Nq > 2, we 
have, 

C(n) > (l - ^-) (l - 1 + l gN J lo g™ + C(N ) - T(N ), for all n > N . 
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2.3 A Simple Upper Bound on C(n) 



Unfortunately, it is not clear how to reverse the inequalities in equations (10) and (11) to get 
a corresponding upper bound on C(n) - especially inequality (c) in (10). Instead we use a 
different argument, one which is less satisfying from an information-theoretic point of view, for 
two reasons. First, although again we do go via the entropy of N, it is not necessary to do so; 
see equation (13) below. And second, we need to use an auxiliary result, namely, the following 
rough estimate on the sum, -&{n) := ^2 p < n ^ogp: 

0(n) : = J2 l °£P ^ (21og2)n, for all n > 2. (12) 

p<n 

For completeness, it is proved at the end of this section. 

To obtain an upper bound on C(n), we note that the entropy of N, H(N) = logn, can 
be expressed in an alternative form: Let Q denote the probability mass function of N, so that 
Q(k) = 1/n for all 1 < k < n. Since N <n = 1/Q(N) always, we have, 



H(N) =E[- log Q(N)] >£[log N] = E[ log Y[ P Xp \ = ^£(X p )logp. (13) 

p<n P<n 

Therefore, recalling (8) and using the bound (12), 

^-^ \p n J ^-^ p n ^-^ p 

p<n P<n P<™ 

thus proving: 

Theorem 3. [Upper Bound] For all n > 2, 

logp ^ i og n + 21og2. 

p<n 



Theorem 3 together with Corollary 1 prove Theorem 1. Of course the use of the entropy could 
have been avoided entirely: Instead of using that H(N) = logn in (13), we could simply use 
that n > N by definition, so logn > E'flogiV], and proceed as before. 

Finally (paraphrasing from [10, p. 341]) we give an elegant argument of Erdos that employs 
a cute, elementary trick to prove the inequality on 'din) in (12). First observe that we can 
restrict attention to odd n, since #(2n) = #(2n — 1), for all n > 2 (as there are no even primes 
other than 2). Let n > 2 arbitrary; then every prime n + l<p<2n + l divides the binomial 
coefficient, 

B ._ ( 2n + l \ _ ( 2ra + 1 ) ! 
\ n J n\{n + 1)! ' 

since it divides the numerator but not the denominator, and hence the product of all these 
primes also divides B. In particular, their product must be no greater than B, i.e., 

-|— r „ 1 (In + A 1 

II ^ B =2{ n J + 2 
n+Kp<2n+l v 7 



\n + 1 J ~ 2 V ' 
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or, taking logarithms, 



0(2n + 1) - 0(n + 1) = logp = log [] p < (2 log 2)n. 

n+Kp<2n+l n+Kp<2n+l 

Iterating this bound inductively gives the required result. 



2.4 Other Information-Theoretic Bounds on the Primes 

Billingsley in his 1973 Wald Memorial Lectures [3] appears to have been the first to connect 
the entropy with properties of the asymptotic distribution of the primes. Although there are 
no results in that work based on information-theoretic arguments, he does suggest the heuristic 
upon which part of our proof of Theorem 2 was based, and he also goes in the opposite direction: 
He uses probabilistic techniques and results about the primes to compute the entropy of several 
relevant collections of random variables. 

Chaitin in 1979 [5] gave a proof of the fact that there are infinitely many primes, using 
algorithmic information theory. Essentially the same argument proves a slightly stronger result, 
namely that, 7r(n) > log 1 1 °|^ +1 , for all n > 3. Chaitin's proof can easily be translated into 
our setting as follows. Recall the representation (1) of a uniformly distributed integer N in 
{1,2,..., n}. Since p Xp divides N, we must have p Xp < n, so that each X p lies in the range, 



< X„ < 



logn 



^ logn 



p L log p J log p ' 
and hence, H(X P ) < log ({2g + l) . Therefore, arguing as before, 

logn = H(N) = H(X p ;p<n) < J2 H ( X p) < E lo § + X ) ^ 4") (log logn + 1) 



p<n P<n 



where the last inequality holds for all n > 3. 

It is interesting that the same argument applied to a different representation for N yields a 
marginally better bound: Suppose we write, 

N = M 2 Y\ P Yp , 

p<n 

where M > 1 is the largest integer such that M 2 divides N, and each of the Y p are either zero 
or one. Then H{Y p ) < log 2 for all p, and the fact that M 2 < n implies that H(M) < logL^/nJ. 
Therefore, 

log n = H(N) = H(M, Y pi ,Y P2 ,..., Y p<n) ) < H(M) + £ H(Y P ) < l - log n + vr(n) log 2, 

p<n 

which implies that 7r(n) > 2 °o J2 ' -l or a H n — ^ - 

Finally we mention that in Li and Vitanyi's text [13], an elegant argument is given for 
a more accurate lower bound on 7r(n). Using ideas and results from algorithmic information 
theory, they show that, 7r(n) = ^( ( lo V^ )- But the proof (which they attribute to unpublished 
work by P. Berman (1987) and J. Tromp (1990)) is somewhat involved, and uses tools very 
different to those developed here. 
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Appendix 

Proof of the Summation-by-parts Formula (5). Note that, since ir(k) —ir(k - 1) is zero 
unless k is prime, C(n) can be expressed as a sum over all integers k < n, 

C(n)= £ [vr(fc)-vr(A;-l)]^. 

2<k<n 

Each of the following steps is obvious, giving, 

n 

7r(n) = Y^[n(k) - ir(k - 1)] 

log At fe 



fc=2 
n 



^W^-TT^-l)]- 



k=2 



k log A; 



= £ [c(*) - - i; 



A: 



fc=2 



log fc 



n , n , 

™ fe n_1 fe + 1 
= E^ioii-E^iog^ + i) 

v ; ^loeHl loefc/ K J loe2 w 



log(n + 1) ^ \\og(k + 1) log A;/ log 2 

as claimed, since C(l) = 0, by definition. □ 
Proof of Corollary 1. Choose and fix any No > 2 and let n > Nq arbitrary. Then, 

E mM^)]^m+ E [TTT^^i]. 

Ar <P<« ^ F N <p<n V F U F 

where the last inequality follows from the inequality — log(l — x) < x/(l — 5), for all < x < 
5 < 1, with 5 = I/Nq. Therefore, 

1 logp N \ogp 1" 



log „ < Tm+ E [^f^ 



v- V No — 1 log iVo p 

N <p<n N r u 6 u ^ 

- w + fe)^ 4 "^)^- ^))- (14) 

Dividing by logn and letting n — > oo yields, 

liminf^> (Ar °- 1)1OgiV0 , 
n->-oo logn A?o(l + logiVo) ' 

and since Nq was arbitrary, letting now Nq —> 00 implies (i). 
For all n> N , (14) implies, 

c '<"> - ^) - TTlSgJVo) los " + C(N " } - T(AW ' (15) 

and evaluating this at Nq = 16 gives (ii). □ 
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